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Abstract 

This paper describes a novel framework for automated 
marmoset vocalization detection and classification from within 
long audio streams recorded in a noisy animal room, where 
multiple marmosets are housed. To overcome the challenge of 
limited manually annotated data, we implemented a data 
augmentation method using only a small number of labeled 
vocalizations. The feature sets chosen have the desirable 
property of capturing characteristics of the signals that are 
useful in both identifying and distinguishing marmoset 
vocalizations. Unlike many previous methods, feature 
extraction, call detection, and call classification in our system 
are completely automated. The system maintains a good 
performance of 80% detection rate in data with high number 
of noise events and is able to obtain a classification error of 
15%. Performance can be further improved with additional 
labeled training data. Because this extensible system is capable 
of identifying both positive and negative welfare indicators, it 
provides a powerful framework for non-human primate 
welfare monitoring as well as behavior assessment. 

Index Terms: Automated detection and classification, 
marmoset vocalization, primate behavioral analysis, primate 
welfare monitoring, Teager energy operator 

1. Introduction 

The common marmoset (Callithrix jacchus) is a small new 
world primate that is emerging as an important non-human 
primate model for neuroscience research [l]-[3]. In addition to 
their small size, fast maturation, high fecundity, low 
maintenance, and genetic similarity to human [4][5], one 
distinctive feature of marmosets is their large repertoire of 
vocal behaviors, making them an attractive model for studying 
the origins and neural basis of human language. Vocalizations 
belonging to the same species, or Conspecific Vocalizations 
(CVs), are crucial for social interactions, reproductive success, 
and survival [6]. Marmosets employ their vocalizations to 
contact other group members, indicate submissiveness, 
aggressiveness, anger, fear and alert other group members to 
varying degrees and types of threats [7]. In spite of recent 
efforts to provide a quantitative acoustic analysis [8]-[10], 
there still remains no consensus as to the vocal repertoire of 
the common marmoset. 

A major challenge in utilizing vocalizations for analyzing 
animal behavior is the time and skills required to monitor and 
identify vocalization production by hand. Due to the amount 
of training required, it is difficult to crowd source this task. 
The advancements in machine learning have spurred a recent 
push to automate vocalization monitoring in a range of 


mammals. Such efforts have been used to classify bird songs 
[11], African elephants [12], killer whales [13], and marmosets 
[8]. Recent work on semi-automated marmoset vocalization 
classification [10] is primarily based on the use of short-time 
spectral analysis, which requires the explicit estimation of the 
temporal features derived from this representation. 

In this paper we introduce a novel framework for 
automated detection and classification of positive, negative, 
and neutral welfare indicators using data recorded by 
microphone collars on marmosets in home cage with 
background cage noise. The emphasis here is on a fully 
automated system for capturing naturalistic vocal behaviors. 
This is in contrast to more common approaches of recording 
short testing sessions with manual or semi-automated analysis. 

This paper is outlined as follows: Section 2 describes the 
system architecture, including feature selections. Section 3 
provides preliminary results achieved on a semi-synthetic 
dataset designed to realistically model the actual audio data. 
Section 4 discusses potential future expansions of the system. 

2. System Layout 

The proposed system architecture is divided into three main 
modules. Section 2.1 introduces the set of features used. 
Section 2.2 describes the detection procedure and Section 2.3 
describes the approach for classifying a pre-defmed N number 
of vocalizations (N = 4 in this case). 

2.1. Features 

Figure 1 shows the spectrograms of four marmoset 
vocalizations, which are the focus of this work. Trill is a 
positive welfare indicator, while phee and twitter are 
considered ambiguous, and chatter is considered a negative 
welfare indicator. 
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Figure 1: Spectrograms of four marmoset vocalizations 

A wide variety of features useful in analyzing human speech 
and other animal vocalizations are explored in this paper. First 
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is the basic set of six audio features described in [14] [15], 
which measure statistics based on energy entropy, signal 
energy, zero crossing rate, spectral rolloff, spectral centroid, 
and spectral flux. This feature set is augmented with their 
pairwise variability, which is the mean of the absolute value of 
the derivatives of each feature. In this paper, all the features 
described above are referred to as the Audio Toolbox features. 
Next we extract from Mel-Frequency Cepstral Coefficients 
(MFCC) a feature set that includes the mean of the coefficients 
along with their first and second derivatives, as well as the 
variance, skewness, and kurtosis. Finally, in an effort to 
capture the rapid changes in frequency found in marmoset 
vocalizations such twitters and trills, we consider the Teager 
energy operator (TEO) [16]. The TEO has been used in a 
number of speech applications including automatic speech 
recognition [17], speech enhancement [18], voice activity 
detection [19], hyper-nasality detection [20], and emotion 
recognition [21]. More recently the TEO has been employed in 
the detection and classification of toothed whale vocalizations 
[22]-[24]. Despite the effectiveness of the TEO in vocalization 
analysis for marine life, its effectiveness for analyzing the 
vocalizations of non-human primate remains largely 
unexplored. In an effort to capture the temporal variations in 
the Teager energy over time, we compute the inverse discrete 
cosine transform of the power spectral density. 

All of these features have the desirable property of 
capturing characteristics of the signal that are useful in both 
identifying and distinguishing marmoset vocalizations. 
Furthermore they can be easily extracted in an automated 
manner unlike the features described in more common 
approaches [10]. The relative importance of each of these 
feature sets will be discussed in Section 3. 

2.2. Detection 

Since the detector must make many decisions for every second 
of audio data provided, we select features that have low 
dimensionality and are computationally efficient. We use a set 
of TEO-based features for our detector. From the framed 
signals (with frame=500m5, step=50 7n5), we extract the 
signal energy, the mean Teager energy, and the peak 
amplitude and frequency of the power spectral density of the 
Teager energy. Using these features, we train a simple feed¬ 
forward neural network containing one hidden layer of 3 
neurons to obtain the likelihood that each frame contains a 
vocalization. These likelihood predictions are then converted 
to binary predictions using a threshold, which controls the 
sensitivity of the detector. 

Once each frame has been assigned as either vocalized (0) 
or non-vocalized (1), we merge these decisions in the 
following manner. Consider that each frame is a candidate 
vocalization. We first merge any vocalized frames with fewer 
than number of non-vocalized frames between them into 
the same candidate vocalization. This is done in order to 
prevent strings of vocalizations, such as those found in phees 
and twitters, from being considered as multiple separate 
vocalizations. We then reject any candidate vocalizations 
containing fewer than K 2 number of vocalized frames. These 
frames are deemed too short in duration to model the types of 
vocalizations that we are interested in classifying. Increasing 
will increase the likelihood of merging separate 
vocalizations, while decreasing will raise the likelihood of 
splitting a single vocalization into multiple predicted 
vocalizations. K 2 can be adjusted to control the precision and 
recall of the detector. Lower K 2 will lead to greater sensitivity 


and the ability to detect shorter duration vocalizations, but will 
also increase the false alarm rate. 

2.3. Classification 

The classification module presented here aims to classify four 
vocalizations (trill, phee, twitter, and chatter) and one 
additional category for all other acoustic events. We start with 
a large set of candidate features described in Section 2.1 in 
order to capture spectral-temporal information that helpful in 
classifying between any pair of vocalizations. While using a 
large set of features maximizes the chances of identifying 
useful variables, directly modeling in high-dimensional spaces 
yields overly complex models that are prone to over fitting. To 
avoid this problem, we iteratively select the top 20 features 
using a forward selection algorithm designed to minimize the 
non-parametric upper bound on the Bayes error described in 
[25]. This approach outperformed feature selection by the 
parametrically estimated Bhattacharyya bound. Once the 
optimal subset of features has been identified, we use error- 
correcting output codes [26] to generate different multi-class 
models for standard binary learners: SVMs, naive bayes 
classifiers, decision trees, and discriminant analysis. Analsyis 
of the performance of these different binary learners will be 
discussed in Section 3.3. 

3 . Results 

A common challenge in automated animal vocalization 
classification is the limited labeled data. To overcome this 
limitation, we analyze the system performance on semi¬ 
synthetic data generated using the procedure outlined in 
Section 3.2. The augmented truth data greatly enhanced the 
system development and validation. While the training and 
testing data sets for the detector and classifier are generated 
using the same procedure, the vocalization samples selected 
for each process are distinct. 

3.1. Experimental setup 

We collected vocalizations from two adult marmoset monkeys 
housed together in their home cage (~1 x 1 x 2 m), which is 
located in a large animal room with ~10 other marmoset cages. 
At the time of recording the pair had been together for about 
one year. The subjects moved freely inside their home cage. A 
small voice recorder (PanicTech, 8GB digital recorder, 46 x 5 
X 18 mm, 6.9 g) was embedded into a soft silicone-based 
collar and was worn around each subject’s neck. The sampling 
rate was 48 kHz. Each recording session lasted about 1 hour, 
after which the collars were taken off. All animal procedures 
were performed in accord with National Institute of Health 
guidelines and were approved by Massachusetts Institute of 
Technology Committee on Animal Care. The audio files were 
uploaded to a computer and aligned using Audacity (http:// 
http://www.audacityteam.org) and further analyzed in Matlab 
(Mathworks, Natick, MA). 

3.2. Data Augmentation 

Labeled data is essential for both the training and evaluation of 
the proposed model, however because the acquisition of large 
number of accurate labels in this domain requires a significant 
time from trained analysts, it has been a challenge to obtain 
sufficient labeled vocalizations. Data augmentation is a 
common approach in machine learning to overcome this 
constrain [27] [28]. We have developed an approach, which 



takes a small set of sample voealizations (eall dietionary) and 
augment it to large dataset with baekground noise and other 
aeoustie events that replieate the aeoustie eharaeteristies of a 
eontinuous stream of labeled audio data. The eall dietionary 
used in the experiments eontains 24 phee ealls, 31 trill ealls, 
21 twitter ealls, 6 ehatter ealls, and 69 other aeoustie events. 

To generate augmented audio streams for the deteetor, we 
first replieate the baekground noise found throughout our 
sample reeordings by identifying segments of audio that is free 
from voealizations or other aeoustie events. To ereate a new 
audio noise stream, starting at the 1^^ seeond into the file we 
perform the following: 

1. Randomly seleet 1 seeond of noise from the sample file. 

2. Multiple this noise signal by a triangular window, and 
add it to the eurrent audio segment. 

3. Step forward half a seeond. 

4. Repeat steps 1-3 until reaehing the end of the audio file. 

The result is a eontinuous stream of noise of an arbitrary 

length that elosely models that found in the real reeordings. 
Next we populate the noise stream with voealizations by 
randomly seleeting voealizations and aeoustie events from the 
eall dietionary and adding them at random indiees to the 
baekground noise. The aeoustie events are drawn from a set of 
sample events sueh as eage rattling noises and noise from 
marmosets seratehing their neeks, found in the original audio 
streams. CV plaeement is restrieted so that no new 
voealizations are plaeed on top of previous ones. Onee all 
voealizations have been plaeed the resulting audio stream is 
used to train the deteetor. Note that for evaluation we partition 
our eall dietionary sueh that only part of it is used in training 
and the remainder is used to generate the test data. 

3.3. Vocalization detection results 

Our detection module was tested using the semi-synthetie 
audio streams described in the previous section. We generate 
separate 10-minute segments of audio for both training and 
evaluation, and populate each audio segment with 10 
vocalizations from each call type, along with additional 
acoustic events that represent non-vacal events such as cage 
rattling or noise from animal scratching their neck. We vary 
the number of acoustic events in order to better understand the 
influence of these events on the systems performance. We then 
evaluate the performance of the detector using true positive 
rate (TPR), which is the ratio of true positives over the sum of 
true positives and false negatives, and false positive rate 
(FPR), which is the ratio of false positives over the sum of true 
negatives and false positives. 

The metrics are calculated by considering each frame as a 
separate detection problem. Figure 2 is a plot of the receiver- 
operator characteristics (ROC) curve resulting from each trial 
of this experiment. The ROC curve clearly illustrates the 
trade-off between detection rate and false-alarm rate, and 
shows the impact of acoustic events on the system 
performance. 

3.4. Classification results 

We evaulate our classification module from three perspectives: 
(1) performance of the different classifiers, (2) performance 
vs. the size of the call dictionary, and (3) which feature sets 
provide the most utility in discriminating between the various 
call types. 

To evaluate the classifiers, we generate a synthetic training 
and test vocalizations via the procedure outlined in Section 


3.2. To analyze the dependency of the system on the size of 
the call dictionary, we vary the fraction of vocalizations used 
for training vs. testing from 20% to 50%, and then generate a 
total of 2000 instances (400 per vocalization) each for the 
training and test data. Once the training and test vocalizations 
are generated, we iteratively select the top 20 features using a 
forward selection algorithm designed to minimize the non- 
parametric upper bound on the Bayes error described in [25]. 
We then use error-correcting output codes [26] to generate 
different multi-class models for standard binary learners 
including SVMs, naive bayes classifiers, decision trees, and 
discriminant analysis. We evaluate the performance of each of 



Figure 2: Detection/false alarm tradeoffs with 
increasing umber of noise events. 

these classifiers on the test data for each partition of the call 
dictionary at every feature subset. These results are then 
averaged across a 25 iteration Monte Carlo simulation, and the 
average and standard error of the classification error rates are 
displayed in Figure 3. Though we tested smaller feature 
subsets, we observed the performance of most classifiers 
asymptote to the optimal performance by 20 features, thus we 
present only the results of classifiers constructed on 20 
features. 

From Fig 3, we see that the performance of the classifier is 
dependant upon the size of the call dictionary. 



Figure 3: Comparison of the classification errors (%) from 
four different methods given different CV dictionary sizes. 

Error bars are standard errors. 

Due to the dramatic improvements in performance at each 
increment of dictionary sizes tested, we hypothesize that the 
performance with respect to the dictioanry size is not close to 
asymptote, however we are unable to test this hypothesis at 
any larger sizes as attributing any more than 50% of the CV 
dictionary impairs our ability to estimate the out-of-sample 
performance of each classifier. Additionally, while none of the 
binary learners showed a statistically significant advantage 
over other classifiers, we found that the decision trees 
performed best for smaller dictionary sizes (20% and 30%), 
while the SVM learner yielded the highest performance for 
larger dictionaries (40% and 50%). 









To better understand the eause for these errors, we ean 
look at the eonfusion matrix in Table 1, whieh is drawn from a 
single trial of this elassifieation experiment. This matrix shows 
that the majority of the mistakes made by the proposed model 
eome from eonfusion between twitters and ehatters and 
eonfusion between ehatters and other events. Beeause both 
twitters and ehatters are ealls eontaining periodie bursts of 
energy, the eonfusion between them is not surpising and 
indieates a need for features that better eapture the short-term 
speetral strueture in the twitter. Confusion between the 
ehatters and other aeoustie events likely stems from the 
diffieulty in distinguishing ehatters from the noise resulted 
from the marmosets seratehing their eollars, as the two are 
similar. This difieulty ean be alleviated by the integration of 
data from additional mierophones loeated outside of the eage. 
Inereasing the number of ehatters in the eall dietionary eould 
also result in a more robust representation of them. 


Table 1. Confusion Matrix 


True\Predicted 

Phee 

Trill 

Twitter 

Chatter 

Other 

Phee 

367 

4 

29 

0 

0 

Trill 

6 

385 

5 

0 

4 

Twitter 

3 

1 

337 

47 

12 

Chatter 

0 

0 

0 

289 

111 

Other 

10 

6 

7 

25 

352 


To better understand the relative signifieanee of eaeh 
grouping of features, a seeond experiment is eondueted where 
the feature set is limited to speeifie group of features (Figure 
4). This experiment is identieal to the previous one with a few 
exeeptions. The size of the training dietionary is held eonstant 
at 50% and we instead vary the base feature set. Only 5 or 10 
features are seleeted rather than 20, beeause the Audio 
Toolbox only eontains 10 features total. We find from this 
experiment that the features from the Audio Toolbox yield the 



Figure 4: Performance comparison of individual feature 
sets. Error bars are standard errors. 

highest individual performanee among the 3 feature sets, 
though they only slightly outperform the MFCC grouping. 
When we look at eombinations of feature sets, we find that the 
performanee of the Audio Toolbox and MFCC features 
signifieantly improves when grouped together, and while the 
Teager features don’t improve the performanee when added to 
either of the other sets, they yield a small boost when added to 
their eombination. 

4. Discussion 

This paper represents the preliminary effort in the 
development of a system to automatieally monitor eontinuous 
audio data for marmoset voeal behavior. We have foeused 


primarily on evaluating and tuning the elassifieation model, 
sinee it has the eapability of making up for defieieneies in the 
deteetion system by operating the deteetor in a the high 
deteetion region and using the elassifier to weed out the large 
number of false positives. While the proposed system exhibits 
relatively high performanee in our evaluations thus far, there 
remains signifieant work in refining the design and evaluation 
of the proposed model. Many aspeets of this system may be 
improved with the availability of additional data, whieh will 
allow the use of more sophistieated models for both the 
deteetion and elassifieation modules. 

Furthermore, while the speetral plots based on the Teager 
energy shown in Figure 5 provide a representation that is 
visually distinetive for eaeh voealization type, the features 
extraeted based on this representation have not positively 
influeneed performanee with signifieanee in our evaluations 
thus far. Further researeh is neeessary for more effeetive use of 
TEO in this domain. 

It is also worth noting that we only eonsider four 
eategories of voealizations in this paper, whieh represents a 
small subset of the marmoset’s entire voeal repertoire. Sinee 
the arehiteeture is modular, we ean easily extend the system to 
inelude a broader set of voealizations. 




extracted from the four vocalizations shown in Figure 1. 

5. Conclusions 

This paper presents a novel framework for automated 
marmoset voealization deteetion and elassifieation. Three 
major eomponents of the system are deseribed: automated 
feature extraetion for analyzing the marmoset audio data 
eolleeted in home eage, the deteetion module for identifying 
voealizations from noisy audio streams, and the elassifieation 
module for diseriminating between four different voealization 
types. The proposed system performs well experimentally with 
80% deteetion rate and 20% false alarm on data with high 
number of noise events and a elassifieation error of 15%. The 
arehiteeture is flexible and ean be extended to a larger number 
of voealizations. We believe that sueh automated system has 
the potential to greatly improve primate welfare monitoring 
and behavioral analysis. 
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